Goto

Collaborating Authors

 fine-tuning language model


Fine-Tuning Language Models with Just Forward Passes

Neural Information Processing Systems

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12 memory reduction and up to 2 GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.


Fine-tuning language models to find agreement among humans with diverse preferences

Neural Information Processing Systems

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a single generic user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., should we raise taxes on the rich?), and rate the LLM's generated candidate consensus statements for agreement and quality.


Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees

Neural Information Processing Systems

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AQ-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks.


On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Neural Information Processing Systems

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training from scratch to a fine-tuning'' paradigm. While in some applications the goal is to nudge'' the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate.


Devstral: Fine-tuning Language Models for Coding Agent Applications

Rastogi, Abhinav, Yang, Adam, Jiang, Albert Q., Liu, Alexander H., Sablayrolles, Alexandre, Héliou, Amélie, Martin, Amélie, Agarwal, Anmol, Ehrenberg, Andy, Lo, Andy, Roux, Antoine, Darcet, Arthur, Mensch, Arthur, Bout, Baptiste, Rozière, Baptiste, De Monicault, Baudouin, Bamford, Chris, Wallenwein, Christian, Renaudin, Christophe, Lanfranchi, Clémence, Denoix, Clément, Barreau, Corentin, Mizelle, Darius Dabert Devon, Casas, Diego de las, Chane-Sane, Elliot, Fugier, Emilien, Hanna, Emma Bou, Berrada, Gabrielle, Delerce, Gauthier, Guinet, Gauthier, Novikov, Georgii, Neubig, Graham, Lample, Guillaume, Martin, Guillaume, Jaju, Himanshu, Ludziejewski, Jan, Rute, Jason, Delignon, Jean-Malo, Chabran, JeanHadrien, Studnia, Joachim, Barmentlo, Joep, Amar, Jonas, Roberts, Josselin Somerville, Denize, Julien, Saxena, Karan, Yadav, Karmesh, Khandelwal, Kartik, Chandu, Khyathi Raghavi, Jain, Kush, Lavaud, Lélio Renard, Blier, Léonard, Zhao, Lingxiao, Martin, Louis, Saulnier, Lucile, Gao, Luyu, Pellat, Marie, Guillaumin, Mathilde, Felardos, Mathis, Dinot, Matthieu, Darrin, Maxime, Augustin, Maximilian, Seznec, Mickaël, Gupta, Neha, Raghuraman, Nikhil, Duchenne, Olivier, Wang, Patricia, von Platen, Patrick, Saffer, Patryk, Jacob, Paul, Wambergue, Paul, Kurylowicz, Paula, Chagniot, Philomène, Stock, Pierre, Agrawal, Pravesh, Delacourt, Rémi, Soletskyi, Roman, Sauvestre, Romain, Vaze, Sagar, Gandhi, Sanchit, Subramanian, Sandeep, Dalal, Shashwat, Gandhi, Siddharth, Ghosh, Soham, Mishra, Srijan, Aithal, Sumukh, Antoniak, Szymon, Scao, Teven Le, Lavril, Thibaut, Schueller, Thibault, Foubert, Thomas, Robert, Thomas, Wang, Thomas, Lacroix, Timothée, Bewley, Tom, Nemychnikova, Valeriia, Paltz, Victor, Richard, Virgile, Li, Wen-Ding, Marshall, William, Wang, Xingyao, Zhang, Xuanyu, Wan, Yihan, Tang, Yunhao

arXiv.org Artificial Intelligence

We introduce Devstral-Small, a lightweight open source model for code agents with the best performance among models below 100B size. In this technical report, we give an overview of how we design and develop a model and craft specializations in agentic software development. The resulting model, Devstral-Small is a small 24B model, fast and easy to serve. Despite its size, Devstral-Small still attains competitive performance compared to models more than an order of magnitude larger.


Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective

Neural Information Processing Systems

The two-stage fine-tuning (FT) method, linear probing (LP) then fine-tuning (LP-FT), outperforms linear probing and FT alone. This holds true for both in-distribution (ID) and out-of-distribution (OOD) data. One key reason for its success is the preservation of pre-trained features, achieved by obtaining a near-optimal linear head during LP. However, despite the widespread use of large language models, there has been limited exploration of more complex architectures such as Transformers. In this paper, we analyze the training dynamics of LP-FT for classification tasks on the basis of the neural tangent kernel (NTK) theory.


Fine-Tuning Language Models with Just Forward Passes

Neural Information Processing Systems

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation).


Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

Vij, Anneketh, Liu, Changhao, Nair, Rahul Anil, Ho, Theodore Eugene, Shi, Edward, Bhowmick, Ayan

arXiv.org Artificial Intelligence

This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M(Allal et al., 2024) to Phi-2 (Research, 2023), implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces approaches to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference before and after fine-tuning, while fine-tuning Phi-2 shows notable limitations in recipe generation despite its larger parameter count. The comprehensive evaluation framework and allergen substitution systems provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.


Fine-Tuning Language Models with Just Forward Passes

Neural Information Processing Systems

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation).


Fine-tuning language models to find agreement among humans with diverse preferences

Neural Information Processing Systems

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality.